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Abstract 

Fusion  and  inference  from  multiple  and  massive  disparate  data  sources  -  the  requirement  for 
our  most  challenging  data  analysis  problems  and  the  goal  of  our  most  ambitious  statistical  pattern 
recognition  methodologies  -  has  many  and  varied  aspects  which  are  currently  the  target  of  intense 
research  and  development.  One  aspect  of  the  overall  challenge  is  manifold  matching  -  identifying 
embeddings  of  multiple  disparate  data  spaces  into  the  same  low-dimensional  space  where  joint 
inference  can  be  pursued.  We  investigate  this  manifold  matching  task  from  the  perspective  of 
jointly  optimizing  the  fidelity  of  the  embeddings  and  their  commensurability  with  one  another, 
with  a  specific  statistical  inference  exploitation  task  in  mind.  Our  results  demonstrate  when  and 
why  our  joint  optimization  methodology  is  superior  to  either  version  of  separate  optimization. 
The  methodology  is  illustrated  with  simulations  and  an  application  in  document  matching. 


1  Introduction 

1.1  Motivation 

Let  (S,  J-,  V)  be  a  probability  space,  i.e. ,  E  is  a  sample  space,  J7  is  a  sigma-held,  and  V  is  a  prob¬ 
ability  measure.  Consider  K  measurable  spaces  Hi,  •  •  •  ,  S k  and  measurable  maps  Tik  :  E  — >  5*,. 

Each  7 Tfc  induces  a  probability  measure  Vk  on  5*,.  We  wish  to  identify  a  measurable  metric  space 
X  (with  distance  function  d)  and  measurable  maps  pk  :  S*.  — >  X,  inducing  probability  measures 
Vk  on  X,  so  that  for  \x\,  ■  ■  ■  ,  xk}'  G  Ei  X  ■  ■  ■  X  S k  we  may  evaluate  distances  d(pk1  (x^),  Pk2  (xk2)) 
in  X .  See  Figure  1. 

Given  V  in  E,  we  may  reasonably  hope  that  the  random  variable  d(pkt  0717^  (£1  ),Pk2  0 

717,; 2  (£1 ))  is  stochastically  smaller  than  the  random  variable  d(p ^  o  7T/C|  (£1 ),  Pk2  07rfc2(£ 2))-  That  is, 

‘Corresponding  Author:  Department  of  Applied  Mathematics  and  Statistics,  Johns  Hopkins  University,  Baltimore, 
MD  21218-2682  ;  cep@jhu.edu  .  This  work  is  partially  supported  by  National  Security  Science  and  Engineering 
Faculty  Fellowship  (NSSEFF),  Air  Force  Office  of  Scientific  Research  (AFOSR),  Office  of  Naval  Research  (ONR), 
Johns  Hopkins  University  Human  Language  Technology  Center  of  Excellence  (JHU  HLT  COE),  and  the  American 
Society  for  Engineering  Education  (ASEE)  Sabbatical  Leave  Program. 
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matched  measurements  (£1 ),  7rfc2 (£1 )  representing  a  single  point  £1  in  H  are  mapped  closer  to 
each  other  than  are  unmatched  measurements  ir^  (£1),  7ifc2(£2)  representing  two  different  points 
in  H.  This  property  allows  inference  to  proceed  in  the  common  representation  space  X. 

However,  we  do  not  observe  £  G  H;  we  also  do  not  observe  the  Xk  =  vr &(£)  G  H*.  directly,  nor 
do  we  have  knowledge  of  the  maps  7Tfc.  But  suppose  we  have  access  to  functions  5k  :  H k  X  S*,  -> 
R_l_  =  [0,  oo)  such  that  <ffc(7ifc(£i),  ^(£2))  represents  the  “dissimilarity”  of  outcomes  £1  and  £2 
under  map  irk-  We  propose  to  use  sample  dissimilarities  for  matched  data  in  the  disparate 
spaces  Efc  to  simultaneously  learn  maps  pk  which  allow  for  a  powerful  test  of  matchedness  in  the 
common  representation  space  X. 


Figure  1:  Maps  7 induce  disparate  data  spaces  H*.  from  “object  space”  5.  Manifold  matching  involves 
using  matched  data  {x^}  to  simultaneously  learn  maps  p±, . . . ,  px  from  disparate  spaces  S1? . . . , 
to  a  common  “representation  space”  X,  for  subsequent  inference. 


1.2  Problem  Formulation 

Consider  n  objects  each  measured  under  K  different  conditions, 

xn  ~  •  •  •  ~  xik  ~  •  •  •  ~  XiK ,  i  = 

where  Xu  ~  •••  ~  Xik  ~  •••  ~  XiK  denotes  K  matched  measurements  7Ti (£*),•■■  , tt/^ (£,; ) 
representing  a  single  object  £,;  G  H,  where  H  denotes  the  “object  space”.  The  assumption  of  K 
different  conditions  implies  that  G  H*.  where  the  spaces  Hi,  •  •  •  ,  H k  cannot  be  assumed  to  be 

similar.  We  are  given  K  new  measurements  {yk\k= i>  2/fc  £  Hfc.  The  question  under  consideration 
is:  Does  the  collection  {yk}k=i  a^so  correspond  to  matched  measurements  representing  a  single 
object  measured  under  the  K  conditions? 

We  use  the  H  notation  to  remind  the  reader  that  the  spaces  H*.  cannot  be  assumed  to  be  stan¬ 
dard  finite-dimensional  Euclidean  spaces.  We  do  assume  that  each  space  H&  comes  with  a  within- 
condition  dissimilarity  5f;  -  a  hollow,  symmetric  function  from  X  to  M+  -  through  which  the 
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matched  data  {*■;&}  yields  nxn  dissimilarity  matrices  A*,,  k  =  1,  •  •  ■  ,  K.  For  new  measurements 
{lfk}k= i  we  liave  available  for  each  k  the  within-condition  dissimilarities  5k(yk,  x^),  i  =  1, . . . ,  n. 

Remark  1:  The  and  y k  are  introduced  mainly  for  symbolic  purposes;  the  corresponding 
data  may  not  be  available  or  may  be  too  complex  to  use  directly,  and  we  proceed  from  the 
dissimilarities. 

The  specific  statistical  inference  exploitation  task  we  consider  throughout  most  of  this  article 
is  hypothesis  testing.  Our  goal,  simplified  for  the  case  K  =  2,  is  to  determine  whether  y \  and 
y2  are  a  match.  That  is, 

Ho  :  y\  ~  y-2  versus  H  \  :  y\  no  y2, 

or  equivalently, 

H0:y1=  7Ti(£), y2  =  7 r2(£)  versus  HA:yi  =  7Ti(£), y2  =  7t2(£')  for 
(We  control  the  probability  of  missing  a  true  match.) 

1.3  Manifold  Matching 

We  define  manifold  matching  as  simultaneous  manifold  learning  and  manifold  alignment  -  iden¬ 
tifying  embeddings  of  multiple  disparate  data  sources  into  the  same  low-dimensional  space  where 
joint  inference  can  be  pursued.  Figure  1  depicts  our  framework.  Conditional  distributions  are 
induced  by  maps  from  “object  space”  S.  Our  assumption  is  that  the  conditional  spaces  E^,  are 
not  commensurate.  For  example,  if  the  elements  of  S  are  individual  people,  then  a  photograph 
in  image  space  Si  and  a  biographical  sketch  in  text  document  space  E2  are  not  to  be  directly 
compared.  Indeed,  our  fundamental  premise  defining  disparate  data  sources  is  that  the  various 
Sfc  cannot  profitably  be  treated  as  replicates  of  the  same  kind  of  space.  Rather,  the  various 
spaces  are  different  not  just  in  degree  but  in  kind.  Each  dissimilarity  5^  has  been  tailored  for 
application  to  S&,  and  it  is  inappropriate  to  apply  5f;  on  Sj,  x  E^  for  k!  ^  k.  This  distinguishes 
our  data  fusion  from  conventional  multivariate  analysis. 

In  Figure  1,  matched  points  {*«&}  are  used  to  simultaneously  learn  appropriate  maps  pk 
taking  the  disparate  data  from  the  various  E*.  into  a  common  representation  space  X .  These 
maps  are  then  applied  to  {yk}k=\  yielding  yi:  =  pk(yk )>  whence  (for  K  =  2)  we  use  T  =  d(yi,  y2) 
as  our  test  statistic  and  reject  for  T  “large”. 

Remark  2:  Our  convention  is  to  use  the  “  ~  ”  notation  for  points  in  the  target  space  X, 
contrasted  with  no  tilde  for  points  in  the  original  Efc  spaces. 

Remark  3:  We  will  throughout  consider  the  special  case  of  X  =  Mm  for  some  pre-specified 
target  dimension  m.  The  fundamentally  important  and  challenging  task  of  choosing  the  target 
dimension  -  model  selection  -  will  be  considered  only  as  a  confounding  issue  in  this  paper;  m 
is  a  nuisance  parameter  which  must  be  selected  but  whose  selection  is  beyond  the  scope  of  this 
manuscript. 

1.4  What  are  these  “conditions”  and  what  does  “matched”  mean? 

As  suggested  above,  one  example  of  “conditions”  involves  photographs  {xA}  and  biographical 
sketches  {*j2},  with  “matched”  Xu  ~  Xi2  meaning  that  the  photograph  Xu  and  the  biographical 
sketch  Xi2  are  of  the  same  person. 

Other  illustrative  examples  include:  a  general  image  Sz  caption  scenario,  with  “matched” 
meaning  that  they  go  together;  multiple  languages  for  text  documents,  with  “matched”  mean¬ 
ing  on  the  same  topic;  multiple  modalities  for  photographs  (e.g.,  indoor  lighting  vs  outdoor 
lighting,  two  cameras  of  different  quality,  or  passport  photos  and  airport  surveillance  photos), 
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with  “matched”  meaning  of  the  same  person;  Wikipedia  text  document  and  Wikipedia  hyperlink 
structure,  with  “matched”  meaning  of  the  same  document.  More  generally,  our  framework  may 
be  applicable  to  any  scenario  in  which  multiple  dissimilarity  measures  are  applied  to  the  objects 
at  hand. 

Fundamentally,  “matched”  means  whatever  the  training  data  say  it  means.  We  know  it  when 
we  see  it  -  or,  perhaps  more  accurately,  we  know  unmatched  when  we  see  it;  see  Figure  2. 
Consider,  for  instance,  an  example  of  multiple  languages  for  text  documents,  with  “matched” 
meaning  on  the  same  topic.  Given  English  and  French  Wikipedia  documents  with  the  matching 
provided  by  Wikipedia  itself,  “matched”  means  “on  the  same  topic.”  But  of  course  the  Wikipedia 
documents  are  not  direct  translations  of  one  another,  and  documents  in  different  languages  on 
the  same  topic  may  have  significant  conceptual  differences  due  to  cultural  differences,  etc. 


A 


No  entry  for  heavy 
goods  vehicles. 
Residential  site  only 

« - 

Nid  wyf  yn  y  swyddfa 
ar  hyn  o  bryd.  Anfonwch 
unrhyw  waith  i  w  gyfieithu. 


Figure  2:  An  example  of  “not  matched”  for  multi-lingual  text  documents.  The  English  is  clear 
enough  to  lorry  drivers  —  but  the  Welsh  reads  “I  am  not  in  the  office  at  the  moment.  Send  any  work 
to  be  translated.”  (See  http://news.bbc.co.Uk/2/hi/uk_news/wales/7702913.stm;  permission 
obtained  from  http://www.golwg360.com/Hafan/default.aspx.) 


1.5  Dirichlet  Setting 

While  the  matched  training  data  ultimately  determine  what  “matched”  means,  in  order  to  provide 
a  clear  mathematical  characterization  of  matchedness  we  consider  an  illustrative  Dirichlet  setting. 
This  setting  is  clearly  overly  simplified,  but  it  invokes  some  aspects  of  the  foregoing  example  of 
multiple  languages  for  text  documents. 

Let  Sp  =  {x  E  M^_+1  :  X£  =  1}  be  the  standard  p-simplex.  We  consider  here  the  case 

Si  =  Sp  and  S2  =  Sp  -  the  two  spaces  are,  in  fact,  commensurate  in  this  case,  for  illustration. 
Let  7 i  Dirichlet(l )  represent  n  “objects”  or  “topics”.  Let  X^.  Dirichlet{r^i  +  1)  represent 
document  i  in  language  k.  (Since  the  take  their  value  in  Sp,  we  can  think  of  them  as 
modelling  (normalized)  word  count  histograms  with  p+ 1  distinct  words.  Si  =  S2  =  Sp  suggests 
a  simplified  1-1  word  correspondence  model.  A  permutation  a  indicating  that  the  1-1  word 
correspondence  is  unknown  may  be  applied  to  the  dimensions  of  one  space  with  no  alteration 
to  our  illustration.)  In  this  case,  r  controls  what  it  means  to  be  matched  -  e.g.,  document 
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translation  quality  analogy.  If  r  is  large  (highly  accurate  translations),  then  matched  documents 
Xj\  and  Xi2  will  be  probabilistically  more  similar  than  Xu  and  X^2  f°r  i  i']  if  r  is  small  (rough 
translations),  then  “matched”  doesn’t  mean  much.  Indeed,  the  limiting  case  of  r  — >  oo  (point 
masses)  yields  “matched”  means  “identical”  while  r  =  0  (recall  that  Dirichlet(  1)  is  uniform 
on  the  simplex)  yields  “matched”  means  “no  relationship”.  Figure  3,  with  p  =  2,  provides 
an  illustration  wherein  matched  means  quite  a  lot.  A  real  data  version  of  this  setting  with 
multiple  documents  per  topic  is  depicted  in  Figure  4,  where  three  Linguistic  Data  Consortium 
(LDC)  Enron  email  message  topic  classes  are  projected  into  the  simplex  S 2  via  Fisher’s  Linear 
Discriminant  composed  with  Latent  Semantic  Analysis  (FLDoLSA)  (see,  e.g.,  [1,  2,  3]). 


1  =  10  in  languages  k  =  1 =  2  in  the  standard  2-simplex  S 2 .  The  parameter  r 

controls  the  meaning  of  matchedness  -  the  similarity  of  matched  documents  Xa  and  Xi2  compared 
to  unmatched  documents  X,iA  and  X^2  for  i  ^  i' . 


1.6  Related  Work 

The  2006  David  Hand  polemic  [4]  argued  persuasively  that  a  fundamental  issue  in  statistical 
inference  research  and  development  -  perhaps  the  fundamental  issue  -  is  robustness  in  the  face 
of  test  data  drawn  from  a  distribution  not  the  same  as  the  distribution  from  which  the  training 
data  are  drawn.  The  disparate  information  fusion  described  above  -  combining  multiple  spaces 
with  different  characteristics  -  provides  a  setting  for  investigation  of  related  issues.  The  recent 
survey  [5]  considers  a  wide  range  of  examples  and  methodologies  addressing  this  phenomenon  in 
terms  of  transfer  learning,  domain  adaptation,  multitask  learning,  etc.  The  recent  special  issue 
[6]  is  devoted  entirely  to  dimensionality  reduction  via  subspace  and  submanifold  learning.  The 
majority  of  this  article  considers  the  Neyman-Pearson  hypothesis  testing  setting,  which  provides 
clarity  through  the  most  straightforward  of  inference  tasks.  In  Section  5.2  we  briefly  consider  a 
ranking  task. 

Our  dissimilarity-centric  approach  is  motivated  by  the  2005  Pekalska  and  Duin  book  [7]  on 
the  dissimilarity  representation  for  pattern  recognition  and  the  far-reaching  success  of  multidi¬ 
mensional  scaling  methodologies  [8,  9,  10,  11] 
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<■ 


Figure  4:  An  example  considering  the  FLDoLSA  projection  into  S 2  of  multiple  Enron  email  mes¬ 
sages  identified  with  three  Linguistic  Data  Consortium  (LDC)  topics.  The  three  colored  scatterplots 
-  yellow,  red,  purple  -  represent  documents  from  the  three  topics;  the  green  dots  represent  the 
topic  means.  We  see  that  “matched”,  meaning  “on  the  same  topic”,  does  mean  something  quite  like 
Dirichlet(r'yt,opic  +  1)  hi  this  case  (but  the  variability  “r”  may  be  topic-dependent). 


Combining  information  from  disparate  data  sources  when  the  information  in  the  various 
spaces  is  fundamentally  incommensurate  -  that  is,  a  separate  collection  of  useful  features  can 
be  extracted  from  each  space  but  their  interpoint  geometry  precludes  profitable  alignment  in  a 
common  space  -  is  considered  via  Cartesian  product  space  embedding  in  [12]. 

Preliminary  development  of  our  joint  optimization  methodology  presented  herein,  as  well  as 
an  application  to  classification  tasks,  is  presented  in  [13]. 

1.7  Summary 

In  Section  2  we  frame  the  problem  as  an  optimization  problem,  and  lay  the  groundwork  for  the 
methodologies  proposed  in  Section  3.  Section  4  illustrates  the  methodologies  with  instructive 
simulations  that  illustrate  characteristic  behavior;  in  particular,  a  simulation  involving  Dirichlet 
random  variables  sets  the  stage  for  the  experimental  examples  on  text  documents  presented  in 
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Section  5.  Finally,  Section  6  provides  discussion  and  suggestions  for  several  areas  of  continuing 
research. 


2  Fidelity  and  Commensurability 

As  suggested  in  Figure  1,  our  goal  is  to  identify  maps  pk  taking  5*.  to  Rm  (for  some  pre-specified 
m)  such  that  (for  K  =  2)  the  power  of  the  test,  P[d(y\,y2 )  >  Co\Ha  '■  y i  ^  2/2],  is  large,  where 
the  critical  value  ca  is  determined  by  the  null  distribution  of  the  test  statistic  and  the  allowable 
Type  I  error  level  a. 

We  proceed  using  1 2  error  for  convenience  and  simplicity;  clearly  there  is  ample  reason  to 
consider  other  error  criteria  for  particular  applications.  Similarly,  we  will  assume  symmetric 
dissimilarities  8k- 

The  available  matched  points  {*■;&}  are  used  to  identify  appropriate  maps  pk-  Fidelity  is 
how  well  the  mapping  Xik  e- >•  x^  preserves  original  dissimilarities.  The  within-condition  squared 
fidelity  error  is  given  by 

^  '  {d(xikfXjk)  5k{xik,Xjkf) 

^■27  1  <i<j<n 

for  each  k.  If  the  fidelity  error  is  large,  then  it  is  likely  that  the  mapping  does  not  capture  aspects 
of  original  data  that  may  be  needed  for  inference. 

On  the  other  hand,  even  if  all  fidelity  errors  are  small,  inference  may  fail  if  d(y\ .  j/2)  is  large 
under  the  “matched”  null  hypothesis  Hq  :  2/1  ~  2/2-  Commensurability  is  how  well  the  mappings 
preserve  matchedness;  the  between-condition  squared  commensurability  error  is  given  by 

ecfclfeo  =—  ^  j  Xik2)  ~  5k1k2  (xiki  1  xik2))  ■ 

L  2  77- 

l<i<n 

Alas,  5k1k2  does  not  exist  -  we  have  no  dissimilarity  on  x  Sfc2.  However,  the  concept  of 
“matchedness”  suggests  that  it  might  be  reasonable  to  set  Sk^faikn  xik2)  =  0  f°r  *>  ^2,  in 

which  case  the  commensurability  error  is  the  mean  squared  distance  between  matched  points  - 
the  same  criterion  optimized  by  the  Procrustes  matching  employed  below. 

There  is  also  between-condition  squared  separability  error  given  by 

€sk1k2  =  THY  y  ]  {d(xiki  1  Xjk2 )  —  5k1k2 ixiki  1  xjk2))  ■ 

^■22  1  <i<j<n 

However,  it  is  less  clear  how  to  identify  a  reasonable  stand-in  for  the  8^  k2  terms  in  this  expres¬ 
sion.  We  will  return  to  this  issue  when  presenting  our  joint  optimization  inference  methodology 
proposal  in  Section  3.3  below. 

If  all  these  errors  are  small  -  and  if  the  target  dimensionality  is  low  enough  so  that  estimation 
variance  does  not  dominate  (see  e.g.  [14]  Section  3  and  [15]  Figure  12.1)  -  then  successful  inference 
in  the  target  space  may  be  achievable.  The  idea  of  the  joint  optimization  method  proposed  in 
this  manuscript  (Section  3.3)  is  to  attempt  to  minimize  all  three  of  these  errors  simultaneously. 


3  Inference  Methodologies 

In  this  section  we  present  three  methodologies  for  performing  our  manifold  matching  inference  - 
one  which  focuses  on  fidelity  and  is  based  on  multidimensional  scaling  and  Procrustes  matching, 
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one  which  focuses  on  commensurability  and  is  based  on  canonical  correlation  analysis,  and  then 
our  proposal  for  joint  optimization  of  fidelity  and  commensurability. 

Before  proceeding,  we  briefly  review  multidimensional  scaling,  Procrustes  matching,  and 
canonical  correlation  analysis. 

Multidimensional  scaling  (MDS)  takes  an  n  X  n  dissimilarity  matrix  A  =  [5jj]  and  produces 
a  configuration  of  n  points  X\, . . .  ,xn  in  a  target  metric  space  endowed  with  distance  function 
d  such  that  the  collection  {d(xi,  Xj)}  agrees  as  closely  as  possible  with  the  original  {dij}  under 
some  specified  error  criterion;  see  for  instance  [8,  9,  10,  11].  For  example,  i 2  (also  known  as  “raw 
stress”)  MDS  minimizes  Xi <i<j<n(^(*o  *7)  —  <X)2- 

Out-of-sample  embedding  is  used  throughout  this  paper  -  given  a  configuration  {£j}"=1  of  the 
training  observations  and  dissimilarities  between  test  observations  and  the  training  observations, 
the  test  points  are  embedded  into  the  existing  configuration  so  as  to  be  as  ^-consistent  as  possible 
with  these  dissimilarities.  This  out-of-sample  embedding  can  be  one  at  a  time,  or  jointly  if  the 
dissimilarities  among  multiple  test  observations  are  also  available.  Trosset  and  Priebe  [16]  present 
the  out-of-sample  methodology  appropriate  for  classical  MDS  embeddings.  We  use  raw  stress 
embeddings  herein,  and  the  appropriate  corresponding  out-of-sample  methodology  is  presented 
in  [17]. 

Procrustes  matching  [18,  19,  20,  21]  takes  two  matched  collections  X\  and  X2  of  n  points  in 
Mm  and  finds  the  rigid  motion  transformation  which  optimally  aligns  the  two  collections.  For 
example,  1 2  Procrustes  minimizes  the  Frobenius  norm  ||Ai  —  X'2Q\\f  over  all  m  X  m  matrices 
Q  such  that  QTQ  =  I.  (We  assume  the  dissimilarities  have  been  scaled  so  that  a  scaling  is  not 
required  in  the  Procrustes  mapping.  Thus  Q  defines  a  rigid  motion  mapping  X2  “onto”  X\ .  We 
address  this  issue  briefly  in  Section  6.) 

Canonical  correlation  analysis  (CCA)  takes  a  collection  X\  of  n±  points  in  Rmi  and  a  collec¬ 
tion  X-2  of  n,2  points  in  W712  and  finds  the  pair  of  linear  maps  U±  :  Mmi  — >■  M  and  U2  :  Mm2  — >  M 
which  maximizes  the  correlation  between  X\  =  U\{X\)  and  X2  =  ^(A^).  Performing  m  iter¬ 
ations  of  this  procedure  in  the  successive  orthogonal  subspaces  yields  a  CCA  procedure  which 
maps  to  Mm.  See,  for  instance,  [22,  23,  24]. 

Let  us  now  consider  these  tools  as  building  blocks  for  manifold  matching  inference. 

3.1  Procrustes  o  MDS 

Multidimensional  scaling  yields  low-dimensional  embeddings.  That  is,  Ai  i— >•  X\  and  A2  1— )•  X2 
yields  n  x  m  configurations.  Procrustes  (Ai,  X2)  yields 

Q*  =  arg  min  ||Xi  -  X'2Q\\F. 

QTQ=I 

Given  5k(yk,  xik),  i  =  1  ,...,n  for  k  =  1,2,  out-of-sample  embedding  of  the  test  data  gives 
Vi  l—t  Vi,  1/2  y2  where  the  embedded  points  are  chosen  so  that  their  distances  to  x^  agree 
as  closely  as  possible  with  the  available  dissimilarities.  Using  the  rigid  motion  transformation 
obtained  in  the  Procrustes  step,  both  y\  and  i/2  =  ({y^)1 Q*)T  are  in  with  same  coordinate 
system.  Thus  inference  may  proceed  by  rejecting  for  large  values  of  d{y\,y2)-  We  dub  this 
separate  embedding  approach  “Procrustes  composed  with  multidimensional  scaling”,  or  “pom”. 

From  an  inspection  of  the  raw  stress  multidimensional  scaling  criterion  function,  it  follows 
immediately  that  the  A*.  1— )•  X^  mappings  minimize  fidelity  error.  Thus  we  have  established  the 
following  result: 


Theorem  1:  pom  optimizes  fidelity  without  regard  for  commensurability. 


That  is,  the  maps  pk  are  identified  separately,  with  no  concern  for  whether  the  commensu- 
rability  optimization  in  the  Procrustes  step  will  be  able  to  provide  a  good  alignment. 

3.2  Canonical  Correlation 

Since  canonical  correlation  begins  with  Euclidean  data,  the  first  step  of  this  methodology  neces¬ 
sarily  involves  multidimensional  scaling.  This  appears  similar  to  Procrustes  o  MDS  above,  but 
in  this  case  no  attempt  is  made  to  achieve  meaningful  dimensionality  reduction.  Multidimen¬ 
sional  scaling  yields  high-dimensional  embeddings,  Ai  i— >  X[  and  A2  1— >•  X2.  but  in  this  case 
these  maps  are  to  the  highest-dimensional  space  possible,  Mn_1  in  general.  Canonical  correlation 
finds  linear  maps  to  Mm,  U\  :  X[  >  X\  and  U2  '■  X'2  1— y  X2,  to  maximize  correlation.  Again, 
out-of-sample  embedding  yields  (n—  l)-dimensional  points  j/i  j/J,  U2^  V2-  Then  y\  =  U J y\ 
and  y2  =  U2  y'2  can  be  directly  compared.  An  investigation  of  the  correlation  criterion  function 
shows  that  the  CCA  maps  U \  and  U2  minimize  commensurability  error,  subject  to  linearity. 
Thus  there  is  no  need  for  Procrustes  in  this  case,  and  once  again  inference  may  proceed:  reject 
for  large  values  of  d(y\ ,  jj^)-  We  dub  this  approach  “cca”. 

From  the  equivalence  of  the  correlation  objective  function  and  commensurability  error,  we 
have  established  the  following  result: 

Theorem  2:  cca  optimizes  commensurability  without  regard  for  fidelity. 

That  is,  the  maps  p^  are  identified  jointly,  but  with  no  concern  for  fidelity  of  the  individual 
embeddings  (beyond  linearity). 

3.3  Omnibus  Embedding 

In  response  to  the  optimization  objectives  of  the  two  methodologies  presented  above  -  one  con¬ 
sidering  fidelity  only  and  the  other  considering  commensurability  only  -  we  develop  an  omnibus 
embedding  methodology  explicitly  focused  on  the  joint  optimization  of  fidelity  and  commensu¬ 
rability. 


2nX  2 n 

M 


y  1 

2/2 


nXn 

X 

nXn 

W 

wT 

nXn 

^2 

u{ 

v{ 

U2 

vl 

nXl  7iXl 

U\  U2 


nX  1  71 X 1 

Vi  V2 


Figure  5:  Depiction  of  the  2 n  x  2 n  omnibus  dissimilarity  matrix  M,  including  imputed  dissimilarities 
hb  =  [fii2(xn,Xj2)\  and  out-of-sample  test  data  t/i,  t/2- 
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Under  the  “matched”  assumption,  we  impute  dissimilarities  W  =  [£i2(*ji>  xj2)\  to  obtain  a 
2 n  X  2n  omnibus  dissimilarity  matrix  M.  See  Figure  5,  which  depicts  M  as  a  block  matrix 
consisting  of  the  n  x  n  dissimilarities  matrices  Ai  and  A2  on  the  diagonal  and  W  as  the  n  x  n 
off-diagonal  block.  (This  generalizes  immediately  to  K  >2.)  As  discussed  above,  it  seems 
reasonable  under  Hq  to  set  the  diagonal  elements  5k1k2  (xiki  ,  xik2 )  of  W  to  zero.  (Notice,  however, 
that  ^fc1fe2(*ifci)  xik2)  =  0  f°r  ki  ^  k2  is  not  necessarily  “truth.”  For  instance,  the  Dirichlet  setting 
of  Section  1.5  with  r  <  00  would  have  non-zero  elements  for  diagiW).  Still,  this  “shrinkage”  of 
diagiW )  to  zero  seems  reasonable.)  As  for  the  off-diagonal  elements  of  W,  we  argue  that  either 
leaving  them  as  missing  data  unused  in  the  subsequent  optimization  or  letting  W  =  (Ai  +  A2)/2 
are  reasonable  suggestions;  we  will  return  to  this  imputation  issue  later.  Once  we  have  settled 
on  W .  our  approach  considers  MDS  embedding  of  M  as  2 n  points  in  Mm  -  zeros  on  the  diagonal 
of  W  act  to  force  matched  points  to  be  embedded  near  each  other.  It  is  clear  that  raw  stress 
MDS  applied  to  AI  has  as  its  objective  function  precisely  +  e2o  +  e212  +  e210.  If  diag(W )  =  0 
and  the  off-diagonal  elements  are  treated  as  missing  and  ignored  in  the  optimization,  then  this 
objective  function  reduces  to  a  consideration  of  just  fidelity  and  commensurability. 

Let  un  =  Si(yi,xn)  and  vi2  =  S2(y2,xi2).  Under  H0,  impute  v,A  =  £12(3/1, *12)  and  ui2  = 
£12(2/2,  xn)  via  V\  =  u-2  =  (u±  +  v2)/2.  Out-of-sample  embedding  of  (u[  ,vJ))T  and  ( 
yields  y  1  and  y2.  Reject  for  large  values  of  d(yi,y2).  We  dub  this  omnibus  embedding  approach 
for  joint  optimization  of  fidelity  and  commensurability  “jo/c”. 

Obviously,  the  choice  of  W  is  key  for  this  joint  optimization.  Also,  note  that  weights  can  be 
incorporated  into  the  MDS  optimization  criterion;  this  weighting  can  become  quite  elaborate, 
but  in  its  simplest  form  it  yields  a  more  general  tradeoff  between  fidelity  and  commensurability 
via  w(e^  +  e%)  +  (1 


4  Illustrative  Simulation 

In  this  section  we  present  an  illustrative  Dirichlet  simulation  which  helps  to  elucidate  when  and 
why  our  joint  optimization  methodology  is  superior  to  either  version  of  separate  optimization. 

4.1  Dirichlet  Product  Model 

We  describe  a  probability  model  with  parameters  p,  q,  r,  a,  and  K . 

Let  Hfc  =  Sp+q,  k  =  1,2.  Here  the  simplex  Sp  encodes  “signal”  and  the  simplex  Sq  en¬ 
codes  “noise”.  That  is,  on  Sp  we  let  7 \  Dirichlet(  1)  and  mutually  independent  Xjk  ~ 

Dirichlet(r~fi  +  1)  (signal,  as  in  Section  1.5)  while  on  Sq  we  let  X2k  Dirichlet(  1)  (pure 
noise).  For  a  £  [0, 1],  let  Xik  =  [(1  —  a)Xfk,  aX2k\  -  the  concatenation  of  (weighted)  signal  and 
noise  dimensions.  The  resultant  distribution  for  {XA ,  •  ■  ■  ,  X^x)  is  denoted  by  and 

Fp,q,r,a,K\-yi,---  ,-yn  denotes  the  distribution  conditional  on  the  location  of  the  7,;. 

4.2  Testing 

For  each  of  nmc  Monte  Carlo  replicates  ( nmc  =  1000  in  the  simulations),  we  generate  n  matched 
pairs  according  to  the  Dirichlet  product  model  distribution  Fp  q  r  a  K=2  by  first  generating  71, ... ,  -yn 
and  then,  conditional  on  the  collection  {7*},  generating  the  matched  pair  (Xn,Xi2).  Embed¬ 
dings  are  defined  for  each  of  the  three  competing  methodologies  based  on  this  matched  training 
data.  For  each  test  datum  under  Hq,  one  new  7  is  generated,  a  matched  pair  is  generated,  out- 
of-sample  embedding  is  performed,  and  the  statistic  T  =  d(y\ ,  y2)  is  calculated;  this  is  repeated 
s  times  independently  ( s  =  1000  in  the  simulations)  and  the  critical  value  ca  for  the  allowable 
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Type  I  error  level  a  is  determined  based  on  the  Monte  Carlo  estimate  of  null  distribution  of 
T.  Then  unmatched  pairs  are  generated,  out-of-sample  embedding  is  performed,  and  the  statis¬ 
tic  T  is  calculated  for  test  data  under  H 4;  this  provides  an  estimate  of  the  conditional  power 
P[d{m,y2)  >  ca\HA,iu...,~in]. 

We  perform  nmc  Monte  Carlo  replicates  to  integrate  out  the  71, . . . ,  7n,  yielding  comparative 
power  estimates.  We  also  investigate  conditional  power  for  particular  collections  {7^},  in  order 
to  better  understand  precisely  when  and  why  our  joint  optimization  methodology  is  superior  to 
either  version  of  separate  optimization. 

4.3  Results 

Figure  6  presents  results  from  our  Dirichlet  product  model.  K  =  2,  with  p  =  3,q  =  3,r  = 
100,  a  =  0.1.  The  target  dimension  is  m  =  2.  We  use  n  =  100.  The  allowable  Type  I  error  level 
a  is  plotted  against  power  /3  =  P\d(y\,y2)  >  ca\Hj\\.  The  results  are  based  on  nmc  =  1000 
Monte  Carlo  replicates  with  s  =  1000;  the  differences  in  the  curves  are  statistically  significant. 
In  this  case,  jofc  with  W  =  (Ai  +  A2V2  is  superior  to  both  pom  and  cca. 


Figure  6:  Dirichlet  product  model  simulation  results  plotting  the  Type  I  error  level  a  against  power 
(3  =  P[d(yi ,  y-i)  >  ca\HA],  indicating  that  jofc  is  superior  to  both  pom  and  cca.  See  text  for 
description. 
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4.4  Analysis 

The  Dirichlet  product  model  is  designed  specifically  to  illustrate  when  and  why  jofc  is  superior 
to  both  pom  and  cca  in  terms  of  fidelity  and  commensurability. 

If  q  is  large  with  respect  to  the  target  dimensionality  m,  then  with  high  probability  cca  will 
identify  a  m— dimensional  subspace  in  the  “noise”  simplex  Sg  with  spurious  correlation.  This 
phenomenon  requires  only  that  a  >  0.  In  this  event,  the  out-of-sample  embedding  will  produce 
arbitrary  y\  and  y2,  even  under  Hq.  Thus  the  null  distribution  of  the  test  statistic  will  be 
inflated  by  these  spurious  correlations.  If  the  allowable  Type  I  error  level  is  smaller  than  the 
probability  of  inflation,  then  the  power  of  the  cca  method  will  be  negatively  affected. 

If  a  is  small  and  m  <  p,  then  with  high  probability  the  m— dimensional  subspaces  identified  by 
the  MDS  step  will  come  from  the  “signal”  simplex  Sp.  If  m  <  p,  then  with  positive  probability, 
these  two  subspaces,  identified  separately  in  pom,  will  be  geometrically  incommensurate  (see 
Figure  7).  Thus  the  null  distribution  of  the  test  statistic  will  be  inflated  by  these  incommensurate 
cases.  If  the  allowable  Type  I  error  level  a  is  smaller  than  the  probability  of  inflation,  then  the 
power  of  the  pom  method  will  be  negatively  affected. 

For  large  q  and  small  a,  the  two  phenomena  described  above  occur  in  the  same  model.  The 
jofc  method  is  not  susceptible  to  either  phenomenon:  incorporating  fidelity  into  the  objective 
function  obviates  the  spurious  correlation  phenomenon,  and  incorporating  commensurability  into 
the  objective  function  obviates  the  geometric  incommensurability  phenomenon.  Thus  we  can  es¬ 
tablish  that,  for  a  range  of  Dirichlet  product  model  distributions,  jofc  is  superior  to  both  pom 
and  cca. 

Theorem  3:  Let  m  G  {l,*--  ,min{p—  1,(/}},  a  G  (0,1/2),  and  r  G  (0, oo).  Then  for 
large  q,  small  a,  and  large  r,  there  exists  allowable  Type  I  error  level  a  >  0  such  that  the 
Dirichlet  product  model  distribution  Fp^r,a,K= 2  with  target  dimensionality  m  yields  power 
ft  jofc  >  max{/3j,0  m,  Pcca},  where  power  /3  =  P[d(y  i,  yf)  >  ca\Hjj\  for  the  various  testing  method¬ 
ologies  jofc,  pom ,  and  cca. 

Proof:  Let  b±  denote  the  probability  that  cca  suffers  from  the  spurious  correlation  phe¬ 
nomenon,  and  let  62  denote  the  probability  that  pom  suffers  from  the  geometric  incommensu¬ 
rability  phenomenon.  Then  q  3>  p  implies  that  cca  suffers  from  the  spurious  correlation  phe¬ 
nomenon  with  high  probability  and  thus  61  ~  1  and  (3cca  ~  a.  For  a  ~  0  and  r  sufficiently  large, 
jofc  and  pom  identify  approximately  the  same  embeddings  except  for  the  cases  in  which  pom 
suffers  from  the  incommensurability  phenomenon.  Thus  the  null  distribution  of  T  =  d(yi,y2) 
for  jofc  is  approximately  point  mass  at  zero  while  the  null  distribution  of  T  for  pom  has  62  mass 
3>  0.  Hence  a  «  &2/2  yields  fijofc  ~  1  while  (5pom  ~  1/2-. 

Delving  into  our  simulation  results  via  investigation  of  conditional  power  P[d(y\,y2)  > 
cQ  |  Ha,  71,  7n],  it  is  apparent  that  the  superiority  of  jofc  is  indeed  due  to  occurrences  of 

the  phenomena  described  above  -  individual  Monte  Carlo  replicates  (particular  selections  of  the 
essentially)  are  identified  in  which  the  spurious  correlation  phenomenon  causes  poor  per¬ 
formance  for  cca  or  the  incommensurability  phenomenon  causes  poor  performance  for  pom  but 
in  which  jofc  is  unaffected. 

We  note  that  the  Dirichlet  product  model  introduced  here  as  an  aid  in  understanding  when 
and  why  jofc  is  superior  to  both  pom  and  cca  does  in  fact  (loosely)  model  general  high-dimensional 
real  data  scenarios:  many  dimensions  consisting  mostly  of  noise  along  with  a  few  signal  dimen¬ 
sions. 
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Figure  7:  Idealization  of  the  incommensurability  phenomenon:  for  a  symmetric  collection 
{Tl;  72, 73, 74}  in  the  simplex  513,  all  four  of  the  facet  projections  have  the  same  fidelity  and  are 
geometrically  incommensurable  with  one  another. 
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4.5  Gaussian  Model 


A  Gaussian  model,  analogous  to  the  Dirichlet  product  model  investigated  above,  is  constructed 
here  to  provide  a  sense  of  the  generality  of  models  with  many  dimensions  consisting  mostly  of 
noise  along  with  a  few  signal  dimensions. 

We  consider  p-dimensional  means  Hi  M  ^0,  Ip^j ,  i  =  1,  •  •  ■  ,  n,  analogous  to  the  7*  from  the 

Dirichlet  model.  Matchedness  arises  from  independent  Xjk  N  (ni,  r  lIp ),  i  =  1, . . .  ,n,  k  = 
1, . . .  K ,  for  r  E  (0,  00);  as  r  increases,  the  degree  of  matchedness  increases.  As  before,  we  have 

^-dimensional  “noise”  vectors  Xfk  A f  ^0,  I^j.  Again,  for  a  E  [0, 1],  Xik  =  [(1  —  a)Xjk,aXfk] 
represents  the  concatenation  of  (weighted)  signal  and  noise  dimensions.  As  with  the  Dirichlet 
product  model,  both  the  spurious  correlation  phenomenon  and  the  geometric  incommensurability 
phenomenon  are  present  in  this  Gaussian  model. 

Figure  8  presents  simulation  results  for  this  Gaussian  model,  entirely  analogous  to  those 
depicted  in  Figure  6. 


Figure  8:  Gaussian  model  simulation  results  plotting  the  Type  I  error  level  a  against  power  (3  = 
P[d(y  1,1/2)  >  ca  \Ha],  indicating  jofc  is  superior  to  both  pom  and  cca,  entirely  analogous  to  those 
presented  for  the  Dirichlet  product  model  in  Figure  6. 
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5  Experimental  Results 

5.1  Testing 

A  collection  of  documents  {xn}f=1  are  collected  from  the  English  Wikipedia,  corresponding  to 
the  directed  2-neighborhood  of  the  document  “Algebraic  Geometry.”  This  yields  n  =  1382  and, 
through  Wikipedia’s  own  1-1  correspondence,  the  associated  French  documents  For 

dissimilarity  matrices  Ak,  k  =  1,2,  we  use  the  Lin  Sz  Pantel  discounted  mutual  information 
[25,  26]  and  cosine  dissimilarity  5k(xik,xjk)  =  1  -  (xik  ■  xjk)/(\\xik\\2\\xjk\\2). 

Our  results  are  obtained  by  repeatedly  randomly  holding  out  four  documents  -  two  matched 
pairs  -  and  identifying  the  embeddings  via  cca,  pom,  and  jofc  based  on  the  remaining  n  =  1380 
matched  pairs.  The  two  sets  of  held-out  matched  pairs  are  used  as  y i  and  y2,  via  out-of-sample 
embedding,  to  estimate  the  null  distribution  of  the  test  statistic  T  =  d(y\,y-2)-  This  allows 
us  to  estimate  critical  values  for  any  specified  Type  I  error  level.  Then  the  two  sets  of  held- 
out  tmmatched  pairs  are  used  as  y±  and  y 2,  via  out-of-sample  embedding,  to  estimate  power. 
Target  dimensionality  m  is  determined  by  the  Zhu  and  Ghodsi  automatic  dimensionality  selection 
method  [27],  resulting  in  m  =  6  for  this  data  set. 

Figure  9  plots  the  allowable  Type  I  error  level  against  power.  These  experimental  results 
indicate  that  jofc  is  superior  to  both  pom  and  cca,  and  are  entirely  analogous  to  the  simulation 
results  presented  above. 

5.2  Ranking 

Here  we  consider  a  ranking  task  in  which  matched  training  data  exists  in  disparate  spaces 
and  S2,  but  test  observation  yi  will  be  observed  in  space  E2.  The  task  is  to  find  the  match  for 
y2  amongst  a  candidate  collection  C  =  {j/n,*--  ,yz  1}  C  Sj  of  z  >  1  possibilities.  Using  the 
training  set  of  matched  observations,  we  identify  the  embeddings  via  cca,  pom,  and  jofc,  and 
out-of-sample  embedding  then  yields  y2  and  C  =  {y n,  •  •  •  ,yz  1}.  The  rank  r*  of  the  one  true 
match  to  y2  amongst  the  candidate  collection  C  in  terms  of  {d(y^i,  y2)}^=i  is  our  measure  of 
performance;  r*  =  1  represents  perfect  performance,  r*  =  zj 2  represents  chance,  and  r*  =  z  is 
the  worst  possible. 

For  this  experiment  we  consider  a  different  collection  of  Wikipedia  documents:  all  En¬ 
glish/Persian  (Farsi)  matched  pairs  (matched,  again,  through  Wikipedia’s  own  1-1  correspon¬ 
dence)  for  which  both  documents  in  the  pair  contain  at  least  500  total  words  and  at  least  100 
distinct  words.  There  are  2448  such  pairs.  (The  word-count  restrictions  are  to  ensure  that  the 
documents  are  legitimate  articles,  rather  than  “stubs”  -  place-holders  for  future  articles  on  the 
topic.) 

Figures  10  and  11  present  notched  boxplot  experimental  results  wherein  we  repeatedly  hold 
out  z  =  1000  matched  pairs  from  the  training  set.  (Recall  that  non-overlapping  notches  implies 
a  statistically  significant  difference  of  means.)  Figure  10  depicts  r*  as  a  function  of  target 
dimension  m  for  jofc  (gray)  and  pom  (white).  Performance  improves  for  both  methods  as  m 
increases  from  5  to  25,  with  jofc  superior.  Performance  levels  off  after  m  =  30  (and  degrades 
significantly  for  m  >  50).  Figure  11  depicts  difference  in  ranks,  r*om  —  r*0jc;  differences  greater 
than  0  indicate  jofc  superiority. 
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Figure  9:  Experimental  results  on  English/French  Wikipedia  documents  plotting  the  Type  I  error 
level  a  against  power  (3  =  P[d(yi,  t/2)  >  ca\HA],  indicating  jofc  is  superior  to  both  pom  and  cca.  See 
text  for  description. 


6  Discussion  and  Conclusions 

We  have  presented  a  complete  methodological  core  for  manifold  matching  via  joint  optimiza¬ 
tion  of  fidelity  and  commensurability  and  comprehensive  comparisons  with  either  version  of 
separate  optimization.  Continuing  research  includes  comparison  with  other  standard  compet¬ 
ing  methodologies,  variations  and  generalizations  of  our  omnibus  embedding  methodology,  and 
further  theoretical  developments. 

Here  we  discuss  a  few  of  the  most  pressing  issues. 

K  >  2  Conditions 

It  is  straightforward  to  generalize  the  omnibus  dissimilarity  matrix  M  to  the  case  of  K  >  2 
conditions. 

Pre-Scaling  the  Ak 

The  scale  of  the  various  dissimilarities  has  been  assumed  to  be  consistent.  For  Dirichlet  data,  this 
assumption  is  warranted;  however,  pre-scaling  of  the  Ak  prior  to  constructing  M  is  imperative 
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Figure  10:  Comparative  rank  experimental  results  depicting  the  rank  r*  of  the  one  true  match  to 
test  observation  y2  amongst  the  candidate  collection  C  in  terms  of  {d(y^i,y2)}^=1  as  a  function  of 
target  dimension  m.  For  each  m  €  {5, 10, 15,  •  •  •  ,  50},  there  are  two  boxplots.  These  results  indicate 
that  jofc  (gray)  is  superior  to  pom  (white)  on  this  data  set.  With  z  =  1000,  both  methods  perform 
much  better  than  chance  (r*  =  z/ 2),  although  performance  does  not  achieve  perfection  (r*  =  1).  See 
text  for  description. 


for  the  general  case. 

MDS  Objective 

Our  omnibus  embedding  methodology  can  be  employed  with  MDS  criteria  other  than  raw  stress; 
the  t2  criterion  provides  direct  correspondence  to  fidelity  and  commensurability.  Weighted  t2  is 
straightforward.  Other  MDS  minimization  objectives  have  been  studied  in  depth,  and  should  in 
particular  circumstances  provide  superior  performance. 

Imputation  of  W 

It  seems  reasonable  under  Hq  to  set  the  diagonal  elements  <5fc1fc2(a5ifci,  xik2)  °f  IF  to  zero.  Recall, 
however,  that  this  is  not  necessarily  “truth;”  the  Dirichlet  setting  of  Section  1.5  with  r  <  00  would 
have  non-zero  elements  for  diag(W).  Still,  this  shrinkage  of  diag(W )  to  zero  seems  reasonable. 
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Figure  11:  Comparative  rank  experimental  results  depicting  difference  in  ranks  r*om — r*0jc ;  differences 
greater  than  0  indicate  jofc  superiority.  See  text  for  description. 


However,  there  may  be  cases  for  which  imputing  non-zero  values  would  be  appropriate;  for 
example,  if  information  is  available  suggesting  that  some  matchings  are  unreliable,  then  it  might 
be  advantageous  to  use  larger  values  for  these  matchings. 

As  for  the  off-diagonal  elements  of  W,  we  have  argued  that  either  leaving  them  as  missing  data 
unused  in  the  subsequent  optimization  or  letting  W  =  (Ai  +  A2)/2  are  reasonable  suggestions. 
We  believe  that  more  elaborate  imputation  should  provide  superior  performance.  In  particular,  it 
seems  clear  that  choosing  A  €  [0, 1]  and  setting  W  =  AAi  +  (l— A)A2  or  W  =  (AA^+(1— A)A|)1/2 
will  be  preferable  in  certain  circumstances. 

Model  Selection:  The  Choice  of  Target  Dimensionality  m 

We  have  assumed  throughout  that  X  =  Mm  for  some  pre-specified  target  dimension  m.  First, 
we  note  that,  in  general,  embedding  into  target  spaces  other  than  Euclidean  is  possible  and 
sometimes  productive.  More  pressing  is  the  necessity,  in  many  applications,  for  data-driven 
choice  of  target  dimension.  This  is  in  general  a  vexing  model  selection  task  -  the  bias-variance 
trade-off.  Of  course,  m  =  1  generally  induces  significant  model  bias  and  m=n—  1  generally 
admits  excessive  estimation  variance,  as  characterized  in  [15]  Figure  12.1.  Many  dimensionality 
selection  methods  based  on  the  principle  of  diminishing  returns  in  terms  of  variance  explained 
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are  available  -  in  Section  5.1  we  made  use  of  the  method  proposed  in  [27],  and  in  5.2  we  presented 
results  as  a  function  of  m.  A  dimensionality  selection  methodology  specifically  designed  for  use 
with  our  omnibus  embedding  methodology  is  of  significant  interest. 

One  illustrative  point  in  this  regard  is  that  the  general  commensurate-space  approach  con¬ 
sidered  throughout  this  article  -  for  all  three  approaches  jofc,  pom,  and  cca  -  adds  a  further 
complication  with  respect  to  identification  of  optimal  target  dimension:  the  optimal  target  di¬ 
mension  m*k  for  the  various  A/c  will  not  the  be  same.  This  adds  to  the  degree  of  difficulty  in 
designing  methods  for  identifying  the  optimal  common-space  target  dimension  m*. 

Learning  the 

We  have  assumed  that  the  maps  7r^  from  object  space  S  to  the  conditional  spaces  S &  are  fixed 
(see  Figure  1).  Indeed,  E  and  the  7 r*,  have  been  treated  as  notional  only.  In  some  circumstances, 
it  may  be  possible  to  use  performance  analyses  to  glean  information  concerning  the  induced 
conditional  distributions  and  profitably  adjust  the  7Tfc,  in  a  manner  analogous  to  fusion  frames 
[281  • 

Fast  Omnibus  Embedding 

Out-of-sample  embedding  of  test  data  precludes  re-learning  the  mappings  for  each  inference. 
More  importantly,  it  is  straightforward  to  make  a  version  of  our  omnibus  embedding  methodology 
fast  (O(n)).  Making  an  effective  fast  version  requires  numerous  methodological  choices  for  various 
stages  of  jofc. 

Commensurability  Error  vs  Hausdorff  Distance  on  Gpm 

In  the  simple  setting  of  Euclidean  spaces  Sfc,  the  pom  methodology  yields  two  elements  of  the 
Grassmann  space  GP)m  of  m-dimensional  subspaces  of  MP.  This  space  is  a  manifold  under 
the  Hausdorff  distance  2sin(0/2),  where  9  is  the  canonical  angle  between  subspaces  [29].  Under 
special  conditions  the  Hausdorff  distance  between  pom’s  two  subspaces  and  the  commensurability 
error  between  their  respective  embeddings  are  closely  related. 

See  Figure  12  for  a  first  example,  from  the  Dirichlet  product  model  simulation  presented 
in  Figure  6.  Each  point  in  Figure  12  represents  a  Monte  Carlo  replicate.  We  note  that  the 
Hausdorff  distance  between  pom’s  two  subspaces  and  the  commensurability  error  between  their 
respective  embeddings  are  strongly  correlated.  Furthermore,  the  red  points  represent  replicates 
for  which  the  conditional  power  P[d{y\,y-2)  >  ca|i7J4,'yi, . . . ,  -yn]  is  low  -  predominantly  those 
replicates  for  which  Hausdorff  distance  and  commensurability  error  are  large.  This  demonstrates 
the  effect  of  the  incommensurability  phenomenon  on  pom.  The  jofc  embeddings  are  not  subject 
to  this  deleterious  phenomenon. 

Additional  investigations  concerning  the  superiority  of  jofc  to  pom  due  to  the  incommensu¬ 
rability  phenomenon  involve  this  relationship  between  Hausdorff  distance  and  commensurability 
error.  Significantly  more  involved  investigations  are  required  when,  as  is  the  case  for  proper 
text  document  analysis,  one  uses  a  more  appropriate  dissimilarity  (Hellinger  distance,  or  more 
generally  cc-divergence)  on  the  simplex. 

Three-Way  MDS 

Three-way  MDS  (see,  for  instance,  [11])  addresses  a  problem  superficially  similar  to  joint  opti¬ 
mization  of  fidelity  and  commensurability,  in  which  a  single  configuration  and  two  transformation 
matrices  are  identified  from  two  dissimilarity  matrices  Ai,  A2.  It  may  be  of  interest  to  compare 
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Figure  12:  Commensurability  error  and  Hausdorff  distance  on  the  Grassmannian  Manifold  for  our 
Dirichlet  product  model  simulation  (Figure  6).  Strong  correlation  is  evident.  Furthermore,  the  red 
points  represent  replicates  for  which  the  conditional  power  P[d(f/i,f/2)  >  ca\H a, 'yi,  ■  ■  ■  ,7n]  is  low  - 
predominantly  those  replicates  for  which  Hausdorff  distance  and  commensurability  error  are  large. 


and  contrast  our  omnibus  embedding  methodology  with  various  instantiations  of  three-way  MDS 
-  particularly  the  identity  model  presented  in  [30]. 

6.1  Conclusions 

In  conclusion,  we  have  presented  an  omnibus  embedding  methodology  for  joint  optimization 
of  fidelity  and  commensurability  that  allows  us  to  address  the  manifold  matching  problem  by 
jointly  identifying  embeddings  of  multiple  spaces  into  a  common  space.  Such  a  joint  embedding 
facilitates  statistical  inference  in  a  wide  array  of  disparate  information  fusion  applications.  We 
have  investigated  this  methodology  in  the  context  of  simple  statistical  inference  tasks,  and  com¬ 
pared  and  contrasted  with  competing  fidelity-only  and  commensurability-only  methodologies, 
demonstrating  the  superiority  of  our  joint  optimization. 

We  have  focused  on  a  simple  setting  and  simple  choices  for  various  methodological  options. 
Many  variations  and  generalizations  are  possible,  but  the  presentation  here  provides  the  core 
methodological  instantiation. 
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